https://github.com/nimirum/IODS-project
This week’s topic was about linear regression and its extended form, the multiple linear regression.
I installed ggplot2 and GGally for data anylsis. I read the students2014 data into R.
library(ggplot2)
library(GGally)
lrn14 <- read.csv('data/learning2014.txt',sep=',')
The data is a study from a course “Johdatus yhteiskuntatilastotieteeseen, syksy 2014” about learning and teaching statistics. It has 7 different variables
Data structure and first 5 rows of it.
dim(lrn14)
## [1] 166 7
str(lrn14)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
head(lrn14, n=5)
## gender age attitude deep stra surf points
## 1 F 53 3.7 3.583333 3.375 2.583333 25
## 2 M 55 3.1 2.916667 2.750 3.166667 12
## 3 F 49 2.5 3.500000 3.625 2.250000 24
## 4 M 53 3.5 3.500000 3.125 2.250000 10
## 5 M 49 3.7 3.666667 3.625 2.833333 22
A graphical overview of the data
ggpairs(lrn14, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
There seems to be correlation with points and attitude. Stra and deep values seems to have some shape of normal distribution. The surveys participants are mostly close to age 20 and total points are fairly evenly distributed.
I choose three variables (attitude, stra, surf) as explanatory variables to fit a regression model where exam points is the target (dependent) variable.
model <- lm(points ~ attitude + stra + surf, data = lrn14)
summary(model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
Only attitude seems to correlate (***) with points so I removed stra and surf
model <- lm(points ~ attitude, data = lrn14)
summary(model)
##
## Call:
## lm(formula = points ~ attitude, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Multiple R-squared of the model is approxrimately 0,19. The best value is 1, therefore the model explains 19% of the exam points is due to attitude.
Diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage.
par(mfrow = c(2,2))
plot(model, which = c(1,2,5))
It seems that the only plot that fits the data is Normal QQ.
This week’s topic was about logistic regression
I read the students performance and alcoholic consumption data into R.
library(tidyr); library(dplyr); library(ggplot2); library(boot)
alc <- read.csv('data/students_alc2014.csv',sep=',')
The data set is from UCI Machine Learning Repository, Student Performance Data Set and it contains students performance in mathematics and portuguese, including alcohol consumption. The acl data is a joined data frame from mathematics and portuguese data. The data set has meen modified so that the ‘alc_use’ is the average of ‘Dalc’ and ‘Walc’. The variable ‘high_use’ is TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise.
Glimpse of data and first 5 rows of it.
glimpse(alc)
## Observations: 382
## Variables: 35
## $ school <fctr> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP...
## $ sex <fctr> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F,...
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address <fctr> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,...
## $ famsize <fctr> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, ...
## $ Pstatus <fctr> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T,...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob <fctr> at_home, at_home, at_home, health, other, services...
## $ Fjob <fctr> teacher, other, other, services, other, other, oth...
## $ reason <fctr> course, course, other, home, home, reputation, hom...
## $ nursery <fctr> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ internet <fctr> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes...
## $ guardian <fctr> mother, father, mother, mother, father, mother, mo...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup <fctr> yes, no, yes, no, no, no, no, yes, no, no, no, no,...
## $ famsup <fctr> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes...
## $ paid <fctr> no, no, yes, yes, yes, yes, no, no, yes, yes, yes,...
## $ activities <fctr> no, no, no, yes, no, yes, no, no, no, yes, no, yes...
## $ higher <fctr> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ...
## $ romantic <fctr> no, no, no, yes, no, no, no, no, no, no, no, no, n...
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1 <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2 <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...
head(alc, n=5)
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason
## 1 GP F 18 U GT3 A 4 4 at_home teacher course
## 2 GP F 17 U GT3 T 1 1 at_home other course
## 3 GP F 15 U LE3 T 1 1 at_home other other
## 4 GP F 15 U GT3 T 4 2 health services home
## 5 GP F 16 U GT3 T 3 3 other other home
## nursery internet guardian traveltime studytime failures schoolsup famsup
## 1 yes no mother 2 2 0 yes no
## 2 no yes father 1 2 0 no yes
## 3 yes yes mother 1 2 2 yes no
## 4 yes yes mother 1 3 0 no yes
## 5 yes no father 1 2 0 no yes
## paid activities higher romantic famrel freetime goout Dalc Walc health
## 1 no no yes no 4 3 4 1 1 3
## 2 no no yes no 5 3 3 1 1 3
## 3 yes no yes no 4 3 2 2 3 3
## 4 yes yes yes yes 3 2 2 1 1 5
## 5 yes no yes no 4 3 2 1 2 5
## absences G1 G2 G3 alc_use high_use
## 1 5 2 8 8 1.0 FALSE
## 2 3 7 8 8 1.0 FALSE
## 3 8 10 10 11 2.5 TRUE
## 4 1 14 14 14 1.0 FALSE
## 5 2 8 12 12 1.5 FALSE
I choose 4 variables and present a hypothesis on how they correlate with high/low alchohol consumption. The variables are failure, sex, activities and absences from school. Below are my hypothesis:
H1: High alcohol consumption realtes to the amount of course failures.
H2: Sex indicates alcohol consumption.
H3: More time spent on activities implies less alcohol consupmtion
H4: High amount of absences relates to high alcohol consumption.
Below is the data structure.
variables <- c("high_use","failures","activities","sex","absences")
alc_hyp_test <- select(alc, one_of(variables))
glimpse(alc_hyp_test)
## Observations: 382
## Variables: 5
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ activities <fctr> no, no, no, yes, no, yes, no, no, no, yes, no, yes...
## $ sex <fctr> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F,...
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
gather(alc_hyp_test) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar(fill="darkslategray4")
g1 <- ggplot(alc, aes(x = failures, fill=high_use))
g1 + geom_bar() + ggtitle("Student failures and alcohol consumption")
g2 <- ggplot(alc, aes(x = sex, fill=high_use))
g2 + geom_bar() + ggtitle("Student sex and alcohol consumption")
g3 <- ggplot(alc, aes(x = activities, fill=high_use))
g3 + geom_bar() + ggtitle("Student activities and alcohol consumption")
g4 <- ggplot(alc, aes(x = high_use, y = absences, fill=high_use))
g4 + geom_boxplot() + ylab("absences") + ggtitle("Student absences by alcohol consumption and sex")
About 1/3 of students are categorized as high alcohol consumption users. In first chart (g1) it is hard to say if high alchohol use affect failures or not. The second chart (g2) indicates that males consumpt more alcohol than women. In the third chart (g3) seems that activities implies less achohol use, but not that much. In the last chart (g4) it looks like absences are related to high alchohol consumption.
model <- glm(high_use ~ failures + sex + absences + activities -1, data = alc_hyp_test, family = "binomial")
summary(model)
##
## Call:
## glm(formula = high_use ~ failures + sex + absences + activities -
## 1, family = "binomial", data = alc_hyp_test)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2605 -0.8527 -0.5999 1.0786 2.1020
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## failures 0.43063 0.19104 2.254 0.02419 *
## sexF -1.74791 0.24779 -7.054 1.74e-12 ***
## sexM -0.76106 0.23390 -3.254 0.00114 **
## absences 0.09382 0.02295 4.089 4.33e-05 ***
## activitiesyes -0.34489 0.24071 -1.433 0.15191
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 529.56 on 382 degrees of freedom
## Residual deviance: 422.34 on 377 degrees of freedom
## AIC: 432.34
##
## Number of Fisher Scoring iterations: 4
The students high alchohol use seems to relate most with sex and amount of absences because the coefficents are ***. I drop activities and reproduce logistic regression without it.
model <- glm(high_use ~ failures + sex + absences, data = alc_hyp_test, family = "binomial")
summary(model)
##
## Call:
## glm(formula = high_use ~ failures + sex + absences, family = "binomial",
## data = alc_hyp_test)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1855 -0.8371 -0.6000 1.1020 2.0209
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.90297 0.22626 -8.411 < 2e-16 ***
## failures 0.45082 0.18992 2.374 0.017611 *
## sexM 0.94117 0.24200 3.889 0.000101 ***
## absences 0.09322 0.02295 4.063 4.85e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 424.40 on 378 degrees of freedom
## AIC: 432.4
##
## Number of Fisher Scoring iterations: 4
OR <- coef(model) %>% exp
CI <- confint(model) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.1491257 0.09395441 0.228611
## failures 1.5695984 1.08339644 2.294737
## sexM 2.5629682 1.60381392 4.149405
## absences 1.0977032 1.05169654 1.150848
In here we can see that people with alcohol use are more likely to have high course failure count and high chance of having absences. In odds ratio model the high alcohol students are more probably males than females. These results support my hypothesis.
probabilities <- predict(model, type = "response")
alc_hyp_test <- mutate(alc_hyp_test, probability = probabilities)
alc_hyp_test <- mutate(alc_hyp_test, prediction = probabilities > 0.5)
table(high_use = alc_hyp_test$high_use, prediction = probabilities > 0.5)
## prediction
## high_use FALSE TRUE
## FALSE 259 9
## TRUE 84 30
g <- ggplot(alc_hyp_test, aes(x = probability, y = high_use, col= prediction))
g + geom_point()
table(high_use = alc_hyp_test$high_use, prediction = alc_hyp_test$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.67801047 0.02356021 0.70157068
## TRUE 0.21989529 0.07853403 0.29842932
## Sum 0.89790576 0.10209424 1.00000000
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc_hyp_test$high_use, prob = alc_hyp_test$probability)
## [1] 0.2434555
The prediction model predicts about 24% of the predictions wrong.
cv <- cv.glm(data = alc_hyp_test, cost = loss_func, glmfit = model, K = 10)
cv$delta[1]
## [1] 0.2513089
Cross validation (k=10) seems to add the amount of wrong predictions
This week’s topic was about exploring statistical data
I load the Boston data from MASS libray into R.
library(dplyr);library(tidyr);library(ggplot2)
library(boot);library(MASS);library(corrplot)
library(plotly)
#library(tidyverse)
#Load the Boston data from the MASS package.
data("Boston")
The Boston data set contains information about factors effecting housing values in suburban areas of Boston. The data includes e.g crimerate (crim), full value property tax rate per 10000$ (tax) and pupil- teacher ratio (pratio).
#Explore the structure and the dimensions of the data
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
From the summary, we can see that all the variables in this data frame are numeric.
# The correlation matrix
cor_matrix<-cor(Boston) %>% round(digits = 2)
# Plotting the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos = "b",tl.pos = "d",tl.cex = 0.6)
Above is the correlation chart of the values. In there it’s visible that rad (index of accessibility to radial highways) correlates positively to dis (weighted mean of distances to five Boston employment centres) and lstat(lower status of the population (percent)) correlates positively with medv (median value of owner-occupied homes in $1000s).
The data must be scaled before doing any further analysis from it.
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
Crime rates are categorized to four categories
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low","med_high","high"))
# save it
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
Randomly pick training (80%) and testing (20%) data
# choose randomly 80% of the rows
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
# create train and test set
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
Linear discriminant analysis, in which I use categorical crime rate as the target variable and all the other variables as predictor variables.
lda.fit <- lda(crime ~., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2500000 0.2475248 0.2623762 0.2400990
##
## Group means:
## zn indus chas nox rm
## low 0.9296157 -0.9261779 -0.07742312 -0.8778804 0.42639628
## med_low -0.1133516 -0.2626364 -0.03610305 -0.5421757 -0.15736079
## med_high -0.3723618 0.2209324 0.13623794 0.4160474 0.07751014
## high -0.4872402 1.0172187 -0.02879709 1.0413202 -0.44966689
## age dis rad tax ptratio
## low -0.8654009 0.8775595 -0.6828149 -0.7418498 -0.47383669
## med_low -0.3147161 0.3561123 -0.5488991 -0.4555504 -0.06907037
## med_high 0.4024629 -0.3903883 -0.4217227 -0.3008765 -0.32283945
## high 0.7992979 -0.8466358 1.6371072 1.5133254 0.77958792
## black lstat medv
## low 0.37699981 -0.76251163 0.52092006
## med_low 0.34748360 -0.10125431 -0.01987651
## med_high 0.08979639 0.04694521 0.18014544
## high -0.63155403 0.87809704 -0.68475410
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.09186832 0.702891216 -0.96690457
## indus 0.04436107 -0.286916791 0.29952767
## chas -0.08731176 0.004350385 0.05963200
## nox 0.34157645 -0.811906408 -1.28178167
## rm -0.06247102 -0.127700420 -0.11624383
## age 0.26237761 -0.215567366 -0.09579876
## dis -0.04724658 -0.282516751 0.23690261
## rad 3.17761478 0.970533112 -0.13015647
## tax -0.01239689 -0.013533005 0.60779828
## ptratio 0.12628938 -0.043204355 -0.26907330
## black -0.11940031 0.058631472 0.22110326
## lstat 0.23158515 -0.334926101 0.38607805
## medv 0.18376460 -0.515893403 -0.16837620
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9446 0.0421 0.0133
Plotting LDA results with a biplot
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
Calculate LDA predicting performance
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 18 6 2 0
## med_low 6 18 2 0
## med_high 0 10 9 1
## high 0 0 0 30
Loading the Boston data again and preparing it for clustering
data('Boston')
Boston <- scale(Boston)
dist_eu <- dist(Boston)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4620 4.8240 4.9110 6.1860 14.4000
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
Clustering the data with 4 cluter centroids
km <-kmeans(Boston, centers = 4)
pairs(Boston, col = km$cluster)
Finding the optimal amount of clusters by calculating total of within cluster sum of squares (WCSS) for clusters between 1 to 10.
# calculate WCSS
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
Optimal amount of cluster centroids seems to be 2 from the plot above. The best number of clusters is when the total WCSS drops radically.
K-means clustering again but with 2 cluster centroids.
km <-kmeans(Boston, centers = 2)
pairs(Boston, col = km$cluster)
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
3D plot with crime categories
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=train$crime)
The data is from UNs Development programs Human Development Index (HDI) and Gender Inequality Index data frames. These two are combined together to be a data frame named “Human”. More information about the data UN development programs: HDI.
# dependecies
library(dplyr); library(tidyr); library(dplyr); library(ggplot2)
library(stringr); library(FactoMineR); library(GGally); library(corrplot)
# download human data
human <- read.csv("data/human.csv")
str(human)
## 'data.frame': 155 obs. of 9 variables:
## $ X : Factor w/ 155 levels "Afghanistan",..: 105 6 134 41 101 54 67 149 28 102 ...
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : Factor w/ 155 levels "1,123","1,228",..: 132 109 124 112 113 111 103 123 108 93 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
dim(human)
## [1] 155 9
head(human)
## X Edu2.FM Labo.FM Edu.Exp Life.Exp GNI Mat.Mor
## 1 Norway 1.0072389 0.8908297 17.5 81.6 64,992 4
## 2 Australia 0.9968288 0.8189415 20.2 82.4 42,261 6
## 3 Switzerland 0.9834369 0.8251001 15.8 83.0 56,431 6
## 4 Denmark 0.9886128 0.8840361 18.7 80.2 44,025 5
## 5 Netherlands 0.9690608 0.8286119 17.9 81.6 45,435 6
## 6 Germany 0.9927835 0.8072289 16.5 80.9 43,919 7
## Ado.Birth Parli.F
## 1 7.8 39.6
## 2 12.1 30.5
## 3 1.9 28.5
## 4 5.1 38.0
## 5 6.2 36.9
## 6 3.8 36.9
The structure of the data shows the 9 variables. X is the row.names that are the names of the countries.
# visualize human data
summary(human)
## X Edu2.FM Labo.FM Edu.Exp
## Afghanistan: 1 Min. :0.1717 Min. :0.1857 Min. : 5.40
## Albania : 1 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25
## Algeria : 1 Median :0.9375 Median :0.7535 Median :13.50
## Argentina : 1 Mean :0.8529 Mean :0.7074 Mean :13.18
## Armenia : 1 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20
## Australia : 1 Max. :1.4967 Max. :1.0380 Max. :20.20
## (Other) :149
## Life.Exp GNI Mat.Mor Ado.Birth
## Min. :49.00 1,123 : 1 Min. : 1.0 Min. : 0.60
## 1st Qu.:66.30 1,228 : 1 1st Qu.: 11.5 1st Qu.: 12.65
## Median :74.20 1,428 : 1 Median : 49.0 Median : 33.60
## Mean :71.65 1,458 : 1 Mean : 149.1 Mean : 47.16
## 3rd Qu.:77.25 1,507 : 1 3rd Qu.: 190.0 3rd Qu.: 71.95
## Max. :83.50 1,583 : 1 Max. :1100.0 Max. :204.80
## (Other):149
## Parli.F
## Min. : 0.00
## 1st Qu.:12.40
## Median :19.30
## Mean :20.91
## 3rd Qu.:27.95
## Max. :57.50
##
plot(human, col="blue")
The summary of the variables shows the types of data they hold in. X is as described just a list of country names. Edu2FM is the ratio of secondary education and Labo.FM ratio of labour force participation. Life Exp shows the life expectancy in different nations how long on average people live approximately. Expected years of education (Edu.exp) shows that in minimum years people go to school.
human$GNI<- str_replace(human$GNI, pattern=",", replace ="") %>% as.numeric()
human_ <- select(human, -X)
cor(human_) %>% corrplot()
#PCA on the not standardized human data.
pca_human <- prcomp(human_)
biplot(pca_human, choices = 1:2,cex=c(0.8,1),col = c("grey40", "deeppink2"))
#PCA on standardized/scaled variables
human_std <- scale(human_)
pca_human <- prcomp(human_std)
biplot(pca_human, choices = 1:2,cex=c(0.8,1),col = c("grey40", "deeppink2"))
Above here is the biplot of data frame after PCA. The frame has not been standardized yet, which is why it has a strange shape. PCA maximizes the variance between variables and therefore without standardizing they tend to pack tightly or show only observations of one variable clearly. Standardized data set variable means are scaled in the middle and the plot is much easier to read.
A dataset “tea” from FactoMineR library. The data contains answers from poll on things related to tea time.
data(tea)
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
keep_columns<- c("Tea", "evening", "dinner", "lunch", "where")
tea_time<- select(tea, one_of(keep_columns))
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") +geom_bar() +theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
Multiple Correspondence Analysis with variables Tea, Evening, Dinner, Lunch and Where.
mca <- MCA(tea_time, graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.275 0.221 0.209 0.204 0.185 0.157
## % of var. 19.619 15.761 14.926 14.574 13.242 11.226
## Cumulative % of var. 19.619 35.380 50.306 64.880 78.121 89.347
## Dim.7
## Variance 0.149
## % of var. 10.653
## Cumulative % of var. 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | 0.108 0.014 0.013 | -0.612 0.566 0.427 | -0.508 0.412
## 2 | 0.108 0.014 0.013 | -0.612 0.566 0.427 | -0.508 0.412
## 3 | 0.514 0.321 0.080 | 0.398 0.239 0.048 | 0.585 0.545
## 4 | 0.864 0.907 0.247 | -0.052 0.004 0.001 | 0.738 0.868
## 5 | -0.323 0.126 0.159 | 0.148 0.033 0.033 | -0.186 0.055
## 6 | 0.864 0.907 0.247 | -0.052 0.004 0.001 | 0.738 0.868
## 7 | 0.028 0.001 0.002 | -0.302 0.138 0.242 | -0.034 0.002
## 8 | -0.242 0.071 0.051 | -0.162 0.040 0.023 | -0.661 0.697
## 9 | -0.176 0.038 0.037 | -0.348 0.183 0.145 | 0.678 0.734
## 10 | -0.446 0.241 0.123 | -0.208 0.065 0.027 | 0.051 0.004
## cos2
## 1 0.294 |
## 2 0.294 |
## 3 0.104 |
## 4 0.180 |
## 5 0.053 |
## 6 0.180 |
## 7 0.003 |
## 8 0.378 |
## 9 0.552 |
## 10 0.002 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | -0.037 0.024 0.000 -0.363 | -0.651 9.487 0.139
## Earl Grey | -0.247 2.867 0.110 -5.746 | 0.076 0.339 0.010
## green | 1.529 18.732 0.289 9.297 | 1.015 10.270 0.127
## evening | -0.603 9.089 0.190 -7.539 | 0.694 14.991 0.252
## Not.evening | 0.315 4.752 0.190 7.539 | -0.363 7.838 0.252
## dinner | 2.039 21.192 0.313 9.673 | 0.547 1.895 0.022
## Not.dinner | -0.153 1.595 0.313 -9.673 | -0.041 0.143 0.022
## lunch | -1.104 13.026 0.210 -7.917 | 1.532 31.188 0.403
## Not.lunch | 0.190 2.239 0.210 7.917 | -0.263 5.360 0.403
## chain store | -0.032 0.047 0.002 -0.730 | -0.119 0.818 0.025
## v.test Dim.3 ctr cos2 v.test
## black -6.445 | -0.668 10.529 0.146 -6.608 |
## Earl Grey 1.770 | 0.416 10.679 0.313 9.671 |
## green 6.170 | -0.938 9.265 0.109 -5.703 |
## evening 8.678 | -0.230 1.731 0.028 -2.870 |
## Not.evening -8.678 | 0.120 0.905 0.028 2.870 |
## dinner 2.593 | 1.639 18.005 0.202 7.777 |
## Not.dinner -2.593 | -0.123 1.355 0.202 -7.777 |
## lunch 10.980 | -0.045 0.029 0.000 -0.325 |
## Not.lunch -10.980 | 0.008 0.005 0.000 0.325 |
## chain store -2.737 | -0.498 15.165 0.440 -11.472 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.297 0.222 0.318 |
## evening | 0.190 0.252 0.028 |
## dinner | 0.313 0.022 0.202 |
## lunch | 0.210 0.403 0.000 |
## where | 0.364 0.204 0.496 |
plot(mca, invisible=c("ind"), habillage = "quali")
Three variables stand from the crowd: dinner, tea shop and green tea. They don’t come close to the dimensions and seems correlate with each other. All the other variables gather around the origo, except lunch.